Group 17

Assignment 1

Below is presented the svg/pdf resulting from modifying the figure using Inkscape. test

Assignment 2

Data set SENIC describes the result of measurements taken at different US hospitals. The description of the variables is given in the accompanying document SENIC.pdf.

  1. Read data from SENIC.text into R.

The data is loaded and the first 5 rows are showned below.

##   ID    X1   X2  X3   X4    X5  X6 X7 X8  X9 X10 X11
## 1  1  7.13 55.7 4.1  9.0  39.6 279  2  4 207 241  60
## 2  2  8.82 58.2 1.6  3.8  51.7  80  2  2  51  52  40
## 3  3  8.34 56.9 2.7  8.1  74.0 107  2  3  82  54  20
## 4  4  8.95 53.7 5.6 18.9 122.8 147  2  4  53 148  40
## 5  5 11.20 56.5 5.7 34.5  88.9 180  2  1 134 151  40
## 6  6  9.76 50.9 5.1 21.9  97.0 150  2  2 147 106  40
  1. Create a function that for a given column (vector) X does the following:

Below is declared the function that suffices both requirements.

outliers = function(vec){
  # Stop condition for the outliers function.
  if (!(is.vector(vec) & is.numeric(vec))){
    print("ERROR: Either the object is not a vector or is not numeric.")
    stop()
  }
  
  # Getting the length of the vector in
  # order to construct the indices.
  n = length(vec)
  idxs = 1:n # Indices.
  
  # Getting the quantiles of the vector
  # to calculate outliers.
  quantiles = quantile(vec)
  q1 = quantiles[[2]]  # First quantile.
  q3 = quantiles[[4]]  # Third quantile.
  
  # Constructing the boolean mask that
  # is going to be used to construct the
  # indices.
  mask = (vec > q3 + 1.5 * (q3 - q1)) | (vec < q1 - 1.5 * (q3 - q1))
  idxs = idxs[mask]
  
  return(idxs)
}
  1. Use ggplot2 and the function from step 2 to create a density plot of infection risk in which outliers are plotted as a diamond symbol. Make some analysis of this graph.

We can see that for the Infection Risk variable there are 5 outliers (if we don’t take into account any possible overlapping). These outliers are responsible for the weird tails that the distribution has. If we were to ignore them, probably the kernel density estimation (KDE) would have a different shape.

  1. Produce graphs of the same kind as in step 3 but for all other quantitative variables in the data (aes_string() can be useful here). Put these graphs into one (hint: arrangeGrob() in gridExtra package can be used) and make some analysis.

  1. Create a ggplot2 scatter plot showing the dependence of Infection risk on the Number of Nurses where the points are colored by Number of Beds. Is there any interesting information in this plot that was not visible in the plots in step 4? What do you think is a possible danger of having such a color scale?

This plot shows more information from what’s econded in the plot from the step 4. In this case we can see that there is a cubic relationship between Infection Risk and the Number of Nurses. As the Infection Risk increases it’s variance does as well, so the relationship eventhough exists is not really that clear. We can also argue that there is a limit somewhere 200 number of nurses were extra nurses wont reduce the infection risk. As for the Number of Beds it looks like there is a relationship with the Number of Nurses, since it’s visible that it changes with the ‘limit’ found for the nurses.

The danger of using such a color scale is that it is going to be affected heavily by outliers. If for example the hospital beds are around 200 +- 20 and theres a big hospital with 800 beds, the scale is going to make it look like most hospitals have the same number of beds since they are going to have a similar colour, losing visual information. Another danger about this particular scale colour, is that it’s not so easy to distinguish low valued points when they are close to high valued ones, like for example in the right side of the graph.

  1. Convert graph from step 3 to Plotly with ggplotly function. What important new functionality have you obtained compared to the graph from step 3? Make some additional analysis of the new graph.

We gained the ability to hover over the density and outlier values and get the points to which they belong. We also gained the ability to zoom in and out the graph, that way we can make sure visually that there are only 5 outliers in the data.

  1. Use data plot-pipeline operator to make a histogram of Infection Risk in which outliers are plotted as diamond symbol. Make this plot in the Plotly directly (i.e. without using ggplot2 functionality). Hint: select(), filter(), and is.element() functions might be useful here.
  1. Write a Shiny app that produces the same kind of plot as in step 4 but in addition include:

Comment how the graphs change with vaying bandwith and which bandwidth value is optimal from your point of view.

The bandwidth parameter changes how smooth is going to be the kernel density estimation (KDE). The lower the value the more faithful to the data the plot is going to be but the hard to spot patterns will be. In this case, we consider that the value 0.38 is a reasonable number between smoothing and fidelity for the graph.

# TASK 8
library(shiny)

# Creating a vector for the names
# of the variables so it's easier
# for the user to read.
feature_names = c("Length of Stay",
                   "Age",
                   "Infection Risk",
                   "Routine Culturing Ratio",
                   "Routine Chests X-ray Ratio",
                   "Number of Beds",
                   "Medical School Affiliation",
                   "Region",
                   "Average Daily Census",
                   "Number of Nurses",
                   "Avialable Facilities & Services")

checkbox_list = list()

for (i in 1:length(feature_names)){
  checkbox_list[[i]] = checkboxInput(df_names[(i + 1)], feature_names[i], FALSE)
}

# Creating the UI for the shiny app.
ui = fluidPage(
  sliderInput(inputId="ws", label="Choose bandwidth size", value=1, min=0.1, max=1),
  checkbox_list,
  plotOutput("densPlot")
)

# Server side functions.
server = function(input, output) {
  output$densPlot <- renderPlot({
    graphs = list()
    counter = 1
    for (name in df_names[2:length(df_names)]){
      if (input[[name]] == TRUE)
      {
        graphs[[counter]] = density_outliers(df[, name], name, bw=input$ws)
        counter = counter + 1 
      }
    }
    if(length(graphs)>0){
      g = grid.arrange(grobs=graphs)
      g
    }
  })
}

# Run the application 
shinyApp(ui = ui, server = server)

Appendix

# TASK 1
# Importing the data that we are going to use
# and taking a look at it so that we know it 
# was imported properly.
df = read.table("SENIC.txt")
head(df)

# Changing the names of the columns.
df_names = c("ID")

for (i in 1:11){
  df_names = c(df_names, c(paste("X", as.character(i), sep="")))
}

names(df) = df_names
head(df)

# TASK 2
outliers = function(vec){
  # Stop condition for the outliers function.
  if (!(is.vector(vec) & is.numeric(vec))){
    print("ERROR: Either the object is not a vector or is not numeric.")
    stop()
  }
  
  # Getting the length of the vector in
  # order to construct the indices.
  n = length(vec)
  idxs = 1:n # Indices.
  
  
  # Getting the quantiles of the vector
  # to calculate outliers.
  quantiles = quantile(vec)
  q1 = quantiles[[2]]  # First quantile.
  q3 = quantiles[[4]]  # Third quantile.
  
  # Constructing the boolean mask that
  # is going to be used to construct the
  # indices.
  mask = (vec > q3 + 1.5 * (q3 - q1)) | (vec < q1 - 1.5 * (q3 - q1))
  idxs = idxs[mask]
  
  return(idxs)
}

# TASK 3
library(ggplot2)
outliers_idxs = outliers(df[, "X3"])
Y = rep(0, length(outliers_idxs))
X = df[outliers_idxs, "X3"]

g = ggplot() + geom_point(aes(x=X, y=Y), shape=5, size=5) + geom_density(aes(df$X3)) + xlab("X3")
print(g)

# TASK 4
library("grid")
library("gridExtra")

density_outliers = function(vec, name, bw="nrd0")
{
  outliers_idxs = outliers(vec)
  Y = rep(0, length(outliers_idxs))
  X = vec[outliers_idxs]
  
  g = ggplot() +  stat_density(aes(vec), bw=bw) + geom_point(aes(x=X, y=Y), shape=5, size=3) + xlab(name)
  
  return(g)
}

graphs = list()
counter = 1

for (name in df_names[2:length(df_names)]){
  graphs[[counter]] = density_outliers(df[, name], name)
  counter = counter + 1
}

grid.arrange(grobs=graphs, ncol=4)

# TASK 5 : TODO:
# g = ggplot(aes(x = df$X3, y = df$X10, color=df$X6))
g = ggplot() + geom_point(aes(x=df$X3, y=df$X10), shape=1, size=3, color=df$X6)
g
# TASK 6
library("plotly")
ggplotly(g)

# TASK 7
p = plot_ly() %>% 
  add_histogram(x = df[, "X3"]) %>%
  add_trace(x=X, y=Y, mode="markers", type="scatter", marker=list(symbol="diamond", size=10)) %>%
  layout(bargap = 0.05)
p

# TASK 8
library(shiny)

# Creating a vector for the names
# of the variables so it's easier
# for the user to read.
feature_names = c("Length of Stay",
                   "Age",
                   "Infection Risk",
                   "Routine Culturing Ratio",
                   "Routine Chests X-ray Ratio",
                   "Number of Beds",
                   "Medical School Affiliation",
                   "Region",
                   "Average Daily Census",
                   "Number of Nurses",
                   "Avialable Facilities & Services")

checkbox_list = list()

for (i in 1:length(feature_names)){
  checkbox_list[[i]] = checkboxInput(df_names[(i + 1)], feature_names[i], FALSE)
}

# Creating the UI for the shiny app.
ui = fluidPage(
  sliderInput(inputId="ws", label="Choose bandwidth size", value=1, min=0.1, max=1),
  checkbox_list,
  plotOutput("densPlot")
)

# Server side functions.
server = function(input, output) {
  output$densPlot <- renderPlot({
    graphs = list()
    counter = 1
    for (name in df_names[2:length(df_names)]){
      if (input[[name]] == TRUE)
      {
        graphs[[counter]] = density_outliers(df[, name], name, bw=input$ws)
        counter = counter + 1 
      }
    }
    if(length(graphs)>0){
      g = grid.arrange(grobs=graphs)
      g
    }
  })
}

# Run the application 
shinyApp(ui = ui, server = server)